AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)# Install libraries for ZIP code lookup
!pip install uszipcode
# Import libraries for data manipulation
import numpy as np
import pandas as pd
# Import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Import libraries for ZIP code lookup
import uszipcode as zip
# Import libraries for error handling
import warnings
# Import libraries for ML-scikit-learn
import sklearn.model_selection as sms
import sklearn.tree as ste
import sklearn.metrics as smt
#Import libraries for statistics
import scipy.stats as stt
# Apply settings
# Ignore warnings
warnings.filterwarnings('ignore')
# Remove the limit for the displayed columns in a DataFrame
pd.set_option('display.max_columns', None)
# Set precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)
# Read the file
df = pd.read_csv('/content/alllife_customer.csv')
# Display the rows and shape of the DataFrame
df
# Display the summary of the DataFrame
df.info()
# Display how many duplicate records are present in the dataset
df.duplicated().sum()
# Display how many duplicate customer IDs are present in the dataset
df[df['ID'].duplicated() == True]
# Display unique values in the dataset
df.nunique()
# Display the statistical summary of all columns
df.describe(include = 'all').transpose()
# Display again the summary of the DataFrame
df.info()
# Create a list of all categorical columns and numerical columns for additional analysis
cat_cols = ['ZIPCode', 'Family', 'Education', 'Personal_Loan', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']
num_cols = ['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage']
# Display the count of unique categorical values in each column
for cat_col in cat_cols:
print(df[cat_col].value_counts())
print('-' * 50)
# Display the percentage of unique categorical values in each column
for cat_col in cat_cols:
print(df[cat_col].value_counts(normalize = True).mul(100))
print('-' * 50)
Observations:
# User-defined functions
def show_boxplot_histplot(data, feature, hue = None, figsize = (11.75, 7), kde = True, bins = None):
"""
Description: Function to plot a boxplot and a histogram along the same scale
Parameters:
data: pandas.core.frame.DataFrame, required
The DataFrame of the two-dimensional tabular data
feature: str, required
Name of the feature column
hue: str, optional
To show the hue, default: None
figsize: tuple, optional
The figure size in inches, default: (11.75, 7)
kde: bool, optional
To show the kernel density estimate, default: True
bins: int, optional
The number of bins for histogram, default: None
"""
# Creating the 2 subplots
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows = 2,
sharex = True,
gridspec_kw = {'height_ratios': (0.25, 0.75)},
figsize = figsize
)
# Adjust the subplot layout parameters
f2.subplots_adjust(hspace = 0.25)
# Create the boxplot with a star to indicate the mean value of the column
sns.boxplot(
data = data,
x = feature,
hue = hue,
ax = ax_box2,
showmeans = True,
color = 'violet'
)
# Create the histogram
if bins:
sns.histplot(
data = data,
x = feature,
hue = hue,
kde = kde,
ax = ax_hist2,
bins = bins,
palette = 'winter'
)
else:
sns.histplot(
data = data,
x = feature,
hue = hue,
kde = kde,
ax = ax_hist2
)
# Add mean to the histogram
ax_hist2.axvline(data[feature].mean(), color = 'green', linestyle = '--')
# Add median to the histogram
ax_hist2.axvline(data[feature].median(), color = 'black', linestyle = '-')
# Set title
ax_box2.set_title(('Boxplot of ' + feature), fontsize = 11)
ax_hist2.set_title(('Distribution of ' + feature), fontsize = 11)
def show_countplot(data, feature, hue = None, n = None, ascending = False, figsize = (11.75, 5)):
"""
Description: Function to plot a barplot with labeled percentage or count
Parameters:
data: pandas.core.frame.DataFrame, required
The DataFrame of the two-dimensional tabular data
feature: str, required
Name of the feature column
hue: str, optional
To show the hue, default: None
n: int, optional
To show the top n category levels, default: None (display all levels)
ascending: bool, optional
To sort the bar by count, default: False
figsize: tuple, optional
The figure size in inches, default: (11.75, 5)
"""
total = len(data[feature])
count = data[feature].nunique()
order = data[feature].value_counts().index.tolist()[:n]
if ascending == True:
order.reverse()
plt.figure(figsize = figsize)
plt.xticks(rotation = 90)
plt.xlim(0, data[feature].value_counts().tolist()[0] * 1.5)
ax = sns.countplot(
data = data,
y = feature,
hue = hue,
palette = 'Paired',
order = order,
)
ax.set_title('Number of ' + feature, fontsize = 11)
for patch in ax.patches:
x = patch.get_x() + patch.get_width()
y = patch.get_y() + patch.get_height() / 1.75
cnt = ('{:.0f}').format(patch.get_width())
pct = ('{:.2f}%').format(100 * patch.get_width() / len(df))
ax.annotate(
cnt + ' (' + pct + ')',
(x, y),
ha = 'left',
va = 'center',
xytext = (0, 2.5),
textcoords = 'offset points'
)
plt.show()
def get_outliers(data, feature):
"""
Description: Function that will return the outliers from a DataFrame
Parameters:
data: pandas.core.frame.DataFrame, required
The DataFrame of the two-dimensional tabular data
feature: str, required
Name of the feature column
"""
q1 = data[feature].quantile(0.25)
q3 = data[feature].quantile(0.75)
iqr = q3 - q1
data = data[((data[feature] < (q1 - 1.5 * iqr)) | (data[feature] > (q3 + 1.5 * iqr)))]
return data
def show_pairplot(data, diag_kind = 'kde', size = 2, hue = None):
"""
Description: Function to plot a barplot with labeled percentage or count
Parameters:
data: pandas.core.frame.DataFrame, required
The DataFrame of the two-dimensional tabular data
diag_kind: str, optional
The type of pairplot diagram, default: kde
size: int, optional
The plot size in inches, default: 2
hue: str, optional
To show the hue, default: None
"""
if hue:
ax = sns.pairplot(data = data, diag_kind = 'kde', size = size, hue = hue)
ax.fig.suptitle('Relationship of Numerical Variables with regards to ' + hue, y = 1.005, size = 11)
else:
ax = sns.pairplot(data = data, diag_kind = 'kde', size = size)
ax.fig.suptitle('Relationship of Numerical Variables', y = 1.005, size = 11)
plt.show()
def show_heatmap(data, figsize = (12, 9), cmap = 'Spectral', annot = True, vmin = -1, vmax = 1, fmt = '.2f'):
"""
Description: Function to plot a barplot with labeled percentage or count
Parameters:
data: pandas.core.frame.DataFrame, required
The DataFrame of the two-dimensional tabular data
figsize: tuple, optional
The figure size in inches, default: (12, 9)
cmap: str, optional
To color map name, default: Spectral
vmin: float, optional
The minimum value to anchor the color map, default: -1
vmax: float, optional
The maximum value to anchor the color map, default: 1
fmt: str, optional
The formatting used in the annotation, default: .2f
"""
plt.figure(figsize = figsize)
ax = sns.heatmap(data.corr(), annot = annot, vmin = vmin, vmax = vmax, fmt = fmt, cmap = cmap)
ax.set_title("Correlation of Numerical Variables", fontsize = 11)
plt.show()
def show_distplot_boxplot(data, feature, target, figsize = (10, 7)):
"""
Description: Function to plot a histogram and a boxplot with hue along the same scale
Parameters:
data: pandas.core.frame.DataFrame, required
The DataFrame of the two-dimensional tabular data
feature: str, required
Name of the feature column
target: str, required
To show the diagrams based on the target's value
figsize: tuple, optional
The figure size in inches, default: (10, 7)
"""
fig, axs = plt.subplots(2, 2, figsize = figsize)
target_uniq = data[target].unique()
axs[0, 0].set_title('Distribution of ' + feature + ' for ' + target + ' = ' + str(target_uniq[0]), fontsize = 11)
sns.histplot(
data = data[data[target] == target_uniq[0]],
x = feature,
kde = True,
ax = axs[0, 0],
color = 'teal',
stat = 'density',
)
axs[0, 1].set_title('Distribution of ' + feature + ' for ' + target + ' = ' + str(target_uniq[1]), fontsize = 11)
sns.histplot(
data = data[data[target] == target_uniq[1]],
x = feature,
kde = True,
ax = axs[0, 1],
color = 'orange',
stat = 'density',
)
axs[1, 0].set_title('Boxplot of ' + feature + ' w/ regards to ' + target, fontsize = 11)
sns.boxplot(
data = data,
x = target,
y = feature,
ax = axs[1, 0],
palette = 'gist_rainbow'
)
axs[1, 1].set_title('Boxplot (w/o outliers) of ' + feature + ' w/ regards to ' + target, fontsize = 11)
sns.boxplot(
data = data,
x = target,
y = feature,
ax = axs[1, 1],
showfliers = False,
palette = 'gist_rainbow'
)
plt.tight_layout()
plt.show()
def show_stackedbarplot(data, feature, target, figsize = (5, 3)):
"""
Description: Function to plot a stacked barplot with hue within the same bar
Parameters:
data: pandas.core.frame.DataFrame, required
The DataFrame of the two-dimensional tabular data
feature: str, required
Name of the feature column
target: str, required
To display the hue within the same plot using the target's value
figsize: tuple, optional
The figure size in inches, default: (5, 3)
"""
count = data[feature].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[feature], data[target], margins = True)
display(tab1)
tab2 = pd.crosstab(data[feature], data[target], normalize = 'index').sort_values(by = sorter, ascending = False)
ax = tab2.plot(kind = 'bar', stacked = True, figsize = figsize)
ax.set_title("Stacked Barplot of " + feature + ' w/ regards to ' + target, fontsize = 11)
plt.legend(loc = 'lower left', frameon = False)
plt.legend(loc = 'upper left', bbox_to_anchor = (1, 1))
plt.show()
def show_pointplot(data, feature, category, target, estimator = 'mean', figsize = (5, 3)):
"""
Description: Function to plot a pointlot with category variable and hue
Parameters:
data: pandas.core.frame.DataFrame, required
The DataFrame of the two-dimensional tabular data
feature: str, required
Name of the feature column
category: str, required
Name of the categorical column
target: str, required
To display the hue within the same plot using the target's value
estimator: str, required
The calculated cetral tendency of the feature , default: mean
figsize: tuple, optional
The figure size in inches, default: (5, 3)
"""
plt.figure(figsize = figsize)
ax = sns.pointplot(data = data, y = feature, x = category, hue = target, estimator = estimator)
ax.set_title("Pointplot of " + feature + " per " + category + ' w/ regards to ' + target, fontsize = 11)
plt.show()
def show_significance(data, target, significance_level = 0.05):
"""
Description: Function to significance of each feature variables vs target variable
Parameters:
data: pandas.core.frame.DataFrame, required
The DataFrame of the two-dimensional tabular data
target: str, required
To display the hue within the same plot using the target's value
significance_level: float, optional
The significance level where the p_value will be compared with, default: 0.05
"""
for feature in list(data.columns):
if target != feature:
crosstab = pd.crosstab(data[target], data[feature])
chi, p_value, dof, expected = stt.chi2_contingency(crosstab)
if p_value < significance_level:
print("*", feature, "score has an effect on", target, "as the p_value", p_value.round(3), "< significance_level", significance_level)
else:
print(" ", feature, "score has no effect on", target, "as the p_value", p_value.round(3), ">= significance_level", significance_level)
# Display the distribution of customers with regards to Age
show_boxplot_histplot(data = df, feature = 'Age')
Observations:
# Display the distribution of customers with regards to Experience
show_boxplot_histplot(data = df, feature = 'Experience')
# Display less than 0 years of Experience
df[df['Experience'] < 0]['Experience'].value_counts()
# Experience contains invalid values (i.e. -1, -2 and -3). This can be due to error during data entry
# We will perform imputation using median value to fix this column
df['Experience'] = df['Experience'].apply(lambda exp: df['Experience'].median() if exp < 0 else exp)
df['Experience'] = df['Experience'].astype(int)
df
# Display the distribution of customers with regards to Experience (after imputation)
show_boxplot_histplot(data = df, feature = 'Experience')
Observations:
# Display the distribution of customers with regards to Income
show_boxplot_histplot(data = df, feature = 'Income')
# Create a dataset of outliers for Income
df_income_outliers = get_outliers(data = df, feature = 'Income')
df_income_outliers
Observations:
# Display the distribution of customers with regards to CCAvg
show_boxplot_histplot(data = df, feature = 'CCAvg')
# Create a dataset of outliers for CCAvg
df_ccavg_outliers = get_outliers(data = df, feature = 'CCAvg')
df_ccavg_outliers
Observations:
# Display the distribution of customers with regards to Mortgage
show_boxplot_histplot(data = df, feature = 'Mortgage')
# Create a dataset of outliers for Mortgage
df_mortgage_outliers = get_outliers(data = df, feature = 'Mortgage')
df_mortgage_outliers
Observations:
# Create a new column with the City, State and ZIP code
df['City_State_ZIPCode'] = df['ZIPCode'].apply(lambda code, search = zip.SearchEngine(): search.by_zipcode(code).major_city + ', ' + search.by_zipcode(code).state + ' ' + str(code) if search.by_zipcode(code) else np.nan)
# Create a new column with the City and State
df['City_State'] = df['ZIPCode'].apply(lambda code, search = zip.SearchEngine(): search.by_zipcode(code).major_city + ', ' + search.by_zipcode(code).state if search.by_zipcode(code) else np.nan)
# Display the number of customers with regards to City_State_ZIPCode (filtered to top 20)
show_countplot(data = df, feature = 'City_State_ZIPCode', n = 20, figsize = (11.75, 15))
# Display the number of customers with regards to City_State (filtered to top 20)
show_countplot(data = df, feature = 'City_State', n = 20, figsize = (11.75, 15))
Observations:
# Display the number of customers with regards to Family
show_countplot(data = df, feature = 'Family', figsize = (11.75, 4))
Observations:
# Display the number of customers with regards to Education
show_countplot(data = df, feature = 'Education', figsize = (11.75, 3))
Observations:
# Display the number of customers with regards to Personal_Loan
show_countplot(data = df, feature = 'Personal_Loan', figsize = (11.75, 2))
Observations:
# Display the number of customers with regards to Securities_Account
show_countplot(data = df, feature = 'Securities_Account', figsize = (11.75, 2))
Observations:
# Display the number of customers with regards to CD_Account
show_countplot(data = df, feature = 'CD_Account', figsize = (11.75, 2))
Observations:
# Display the number of customers with regards to Online
show_countplot(data = df, feature = 'Online', figsize = (11.75, 2))
Observations:
# Display the number of customers with regards to CreditCard
show_countplot(data = df, feature = 'CreditCard', figsize = (11.75, 2))
Observations:
# Display the relationships of Numerical Variables
show_heatmap(data = df[num_cols], figsize = (15, 11))
# Display the relationships of Numerical Variables with regards to Personal_Loan
show_pairplot(data = df[num_cols + ['Personal_Loan']], hue = 'Personal_Loan')
Observations:
# Display the distribution of Age with regards to Personal_Loan
show_distplot_boxplot(data = df, feature = 'Age', target = 'Personal_Loan')
# Display the distribution of Age per Family with regards to Personal_Loan
show_pointplot(data = df, feature = 'Age', category = 'Family', target = 'Personal_Loan')
# Display the distribution of Age per Education with regards to Personal_Loan
show_pointplot(data = df, feature = 'Age', category = 'Education', target = 'Personal_Loan')
Observations:
# Display the distribution of Experience with regards to Personal_Loan
show_distplot_boxplot(data = df, feature = 'Experience', target = 'Personal_Loan')
# Display the distribution of Experience per Family with regards to Personal_Loan
show_pointplot(data = df, feature = 'Experience', category = 'Family', target = 'Personal_Loan')
# Display the distribution of Experience per Education with regards to Personal_Loan
show_pointplot(data = df, feature = 'Experience', category = 'Education', target = 'Personal_Loan')
Observations:
# Display the distribution of Income with regards to Personal_Loan
show_distplot_boxplot(data = df, feature = 'Income', target = 'Personal_Loan')
# Display the distribution of Income per Family with regards to Personal_Loan
show_pointplot(data = df, feature = 'Income', category = 'Family', target = 'Personal_Loan')
# Display the distribution of Income per Education with regards to Personal_Loan
show_pointplot(data = df, feature = 'Income', category = 'Education', target = 'Personal_Loan')
Observations:
# Display the distribution of CCAvg with regards to Personal_Loan
show_distplot_boxplot(data = df, feature = 'CCAvg', target = 'Personal_Loan')
# Display the distribution of CCAvg per Family with regards to Personal_Loan
show_pointplot(data = df, feature = 'CCAvg', category = 'Family', target = 'Personal_Loan')
# Display the distribution of CCAvg per Education with regards to Personal_Loan
show_pointplot(data = df, feature = 'CCAvg', category = 'Education', target = 'Personal_Loan')
# Check if CCAvg only accounts for AllLife bank credit card and not the credit card issued by other banks
df[(df['CCAvg'] == 0) & (df['CreditCard'] == 1)]['ID'].count() > 0
# Display the list of customers that do not use the AllLife Bank credit card
id_ccavg = df[df['CCAvg'] == 0]['ID']
id_ccavg
# Display the list of customers that uses the credit card issued by other banks
id_cc1 = df[df['CreditCard'] == 1]['ID']
id_cc1
# Check if all customers that do not use the AllLife bank credit card are using the credit card issued by other banks
id_ccavg.isin(id_cc1).count() == id_ccavg.count()
# Display the number of customers that have a credit card (both AllLife bank and other banks)
df[(df['CCAvg'] > 0) | (df['CreditCard'] == 1)]['ID'].count()
Observations:
# Display the distribution of Mortgage with regards to Personal_Loan
show_distplot_boxplot(data = df, feature = 'Mortgage', target = 'Personal_Loan')
# Display the distribution of Mortgage per Family with regards to Personal_Loan
show_pointplot(data = df, feature = 'Mortgage', category = 'Family', target = 'Personal_Loan')
# Display the distribution of Mortgage per Education with regards to Personal_Loan
show_pointplot(data = df, feature = 'Mortgage', category = 'Education', target = 'Personal_Loan')
Observations:
# Display the distribution of ZIPCode with regards to Personal_Loan (filtered to top 20)
show_stackedbarplot(data = df[df['City_State_ZIPCode'].isin(df['City_State_ZIPCode'].value_counts().head(20).index)], feature = 'City_State_ZIPCode', target = 'Personal_Loan')
# Display the distribution of City_State with regards to Personal_Loan (filtered to top 20)
show_stackedbarplot(data = df[df['City_State'].isin(df['City_State'].value_counts().head(20).index)], feature = 'City_State', target = 'Personal_Loan')
Observations:
# Display the distribution of Family with regards to Personal_Loan
show_stackedbarplot(data = df, feature = 'Family', target = 'Personal_Loan')
Observations:
# Display the distribution of Education with regards to Personal_Loan
show_stackedbarplot(data = df, feature = 'Education', target = 'Personal_Loan')
Observations:
# Display the distribution of Securities_Account with regards to Personal_Loan
show_stackedbarplot(data = df, feature = 'Securities_Account', target = 'Personal_Loan')
Observations:
# Display the distribution of CD_Account with regards to Personal_Loan
show_stackedbarplot(data = df, feature = 'CD_Account', target = 'Personal_Loan')
Observations:
# Display the distribution of Online with regards to Personal_Loan
show_stackedbarplot(data = df, feature = 'Online', target = 'Personal_Loan')
Observations:
# Display the distribution of CreditCard with regards to Personal_Loan
show_stackedbarplot(data = df, feature = 'CreditCard', target = 'Personal_Loan')
Observations:
# Display the significance of each feature with regards to Personal_Loan
show_significance(data = df, target = 'Personal_Loan')
Observations:
# Display the statistical summary of all columns for those who accepted a loan
df[df['Personal_Loan'] == 1].describe(include = 'all').transpose()
# Display the statistical summary of all columns for those who did not accept a loan
df[df['Personal_Loan'] == 0].describe(include = 'all').transpose()
# User-defined functions
def show_boxplot_outliers(data, num_cols, figsize = (10, 10)):
"""
Description: Function to plot multiple boxplot to display the outliers
Parameters:
data: pandas.core.frame.DataFrame, required
The DataFrame of the two-dimensional tabular data
num_cols: list, required
The column names of numeric columns
figsize: tuple, optional
The figure size in inches, default: (10, 10)
"""
plt.figure(figsize = figsize)
for i, variable in enumerate(num_cols):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis = 1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# Drop columns that are not needed in the model building
if 'ID' in df.dtypes:
df.drop(['ID'], axis = 1, inplace = True)
if 'City_State' in df.dtypes:
df.drop(['City_State'], axis = 1, inplace = True)
if 'City_State_ZIPCode' in df.dtypes:
df.drop(['City_State_ZIPCode'], axis = 1, inplace = True)
df
Observations:
# Display if there are outliers in the numerical columns
show_boxplot_outliers(data = df, num_cols = num_cols)
Observations:
# Create independent variable
x = df.drop(['Personal_Loan'], axis = 1)
x
# Create dependent variable
y = df['Personal_Loan']
y
# Split train and test data
x_train, x_test, y_train, y_test = sms.train_test_split(x, y, test_size = 0.3, random_state = 1)
# Display the shape of train data
x_train.shape
# Display the shape of test data
x_test.shape
# Display the percentage of classes in train data
y_train.value_counts(normalize = True)
# Display the percentage of classes in test data
y_test.value_counts(normalize = True)
Observations:
# User-defined functions
def get_model_perf_class_sklearn(model, predictors, target):
"""
Description: Function to compute different metrics to check classification model performance
Parameters:
model: sklearn.tree.DecisionTreeClassifier, required
The DataFrame of the two-dimensional tabular data
predictors: pandas.core.frame.DataFrame, required
The DataFrame of the independent variables
target: str, required
The depedent variable
"""
# Predict using the independent variables
pred = model.predict(predictors)
# Compute accuracy
acc = smt.accuracy_score(target, pred)
# Compute recall
recall = smt.recall_score(target, pred)
# Compute precision
precision = smt.precision_score(target, pred)
# Compute F1-score
f1 = smt.f1_score(target, pred)
# Create a DataFrame of metrics
df_perf = pd.DataFrame({"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1, }, index = [0],)
return df_perf
def show_confusion_matrix_sklearn(model, predictors, target, figsize = (6, 4)):
"""
Description: Function to plot the confusion matrix with percentages
Parameters:
model: sklearn.tree.DecisionTreeClassifier, required
The DataFrame of the two-dimensional tabular data
predictors: pandas.core.frame.DataFrame, required
The DataFrame of the independent variables
target: str, required
The depedent variable
"""
# Predict using the independent variables
y_pred = model.predict(predictors)
# Create confusion matrix
cm = smt.confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize = figsize)
ax = sns.heatmap(cm, annot = labels, fmt = "")
ax.set_title("Confusion Matrix", fontsize = 11)
plt.ylabel("True label")
plt.xlabel("Predicted label")
def show_feature_importance(importances, indices, feature_names):
"""
Description: Function to plot the features in the order of importance
Parameters:
importances: sklearn.tree.DecisionTreeClassifier, required
The DataFrame of the two-dimensional tabular data
indices: numpy.ndarray, required
The indices of array
feature_names: list, required
The column names of features
"""
plt.figure(figsize = (11.75, 5))
plt.title("Feature Importances", fontsize = 11)
plt.barh(range(len(indices)), importances[indices], color = "violet", align = "center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
def show_decision_tree(model, feature_names, figsize = (20, 10)):
"""
Description: Function to plot the features in the order of importance
Parameters:
model: sklearn.tree.DecisionTreeClassifier, required
The DataFrame of the two-dimensional tabular data
feature_names: list, required
The column names of features
figsize: tuple, optional
The figure size in inches, default: (20, 10)
"""
plt.figure(figsize = figsize)
plt.suptitle("Decision Tree", y = 0.9, size = 17)
out = ste.plot_tree(
model_pre_prune,
feature_names = feature_names,
filled = True,
fontsize = 9,
node_ids = False,
class_names = None
)
# Display Decision Tree
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
Model can make wrong predictions as:
Which case is more important?
How to reduce this loss (i.e. To reduce False Negatives)?
# Create Decision Tree model without class_weight
model_wo_class_weight = ste.DecisionTreeClassifier(random_state = 1)
model_wo_class_weight.fit(x_train, y_train)
# Display the confision matrix of train data without weight_class
show_confusion_matrix_sklearn(model = model_wo_class_weight, predictors = x_train, target = y_train)
# Display the model performance of train data without weight_class
decision_tree_perf_train_wo_class_weight = get_model_perf_class_sklearn(model = model_wo_class_weight, predictors = x_train, target = y_train)
decision_tree_perf_train_wo_class_weight
# Display the confision matrix of test data without weight_class
show_confusion_matrix_sklearn(model = model_wo_class_weight, predictors = x_test, target = y_test)
# Display the model performance of test data without weight_class
decision_tree_perf_test_wo_class_weight = get_model_perf_class_sklearn(model = model_wo_class_weight, predictors = x_test, target = y_test)
decision_tree_perf_test_wo_class_weight
# Create Decision Tree model with class_weight
model_w_class_weight = ste.DecisionTreeClassifier(random_state = 1, class_weight = "balanced")
model_w_class_weight.fit(x_train, y_train)
# Display the confision matrix of train data with weight_class
show_confusion_matrix_sklearn(model = model_w_class_weight, predictors = x_train, target = y_train)
# Display the model performance of train data with weight_class
decision_tree_perf_train_w_class_weight = get_model_perf_class_sklearn(model = model_w_class_weight, predictors = x_train, target = y_train)
decision_tree_perf_train_w_class_weight
# Display the confision matrix of test data with weight_class
show_confusion_matrix_sklearn(model = model_w_class_weight, predictors = x_test, target = y_test)
# Display the model performance of test data with weight_class
decision_tree_perf_test_w_class_weight = get_model_perf_class_sklearn(model = model_w_class_weight, predictors = x_test, target = y_test)
decision_tree_perf_test_w_class_weight
# Create Decision Tree model with pre-pruning
model_pre_prune = ste.DecisionTreeClassifier(random_state = 1)
# Define hyperparameters
parameters = {
"class_weight": [{0: 0.15, 1: 0.85}],
"max_depth": [np.arange(1, 15), None],
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Type of scoring used to compare parameter combinations
acc_scorer = smt.make_scorer(smt.recall_score)
# Run the grid search
grid_obj = sms.GridSearchCV(model_pre_prune, parameters, scoring = acc_scorer, cv = 5)
grid_obj = grid_obj.fit(x_train, y_train)
# Set the model to the best combination of parameters
model_pre_prune = grid_obj.best_estimator_
# Fit the best algorithm to the data
model_pre_prune.fit(x_train, y_train)
# Display the confision matrix of train data with pre-pruning
show_confusion_matrix_sklearn(model = model_pre_prune, predictors = x_train, target = y_train)
# Display the model performance of train data with pre-pruning
decision_tree_perf_train_pre_prune = get_model_perf_class_sklearn(model = model_pre_prune, predictors = x_train, target = y_train)
decision_tree_perf_train_pre_prune
# Display the confision matrix of test data with pre-pruning
show_confusion_matrix_sklearn(model = model_pre_prune, predictors = x_test, target = y_test)
# Display the model performance of test data with pre-pruning
decision_tree_perf_test_pre_prune = get_model_perf_class_sklearn(model = model_pre_prune, predictors = x_test, target = y_test)
decision_tree_perf_test_pre_prune
# Create list of columns for pruning
feature_names = list(x_train.columns)
feature_names
# Display the Decision Tree with pre-pruning
show_decision_tree(model = model_pre_prune, feature_names = feature_names)
# Display Decision Tree with pre-pruning as text
print(ste.export_text(model_pre_prune, feature_names = feature_names, show_weights = True))
# Importance of features in the tree building
importances_pre_prune = model_pre_prune.feature_importances_
indices_pre_prune = np.argsort(importances_pre_prune)
importances_pre_prune
# Display feature importance with pre-pruning
show_feature_importance(importances = importances_pre_prune, indices = indices_pre_prune, feature_names = feature_names)
# Obtain cost complexity pruning path
model_post_prune = ste.DecisionTreeClassifier(random_state = 1, class_weight = {0: 0.15, 1: 0.85})
path = model_post_prune.cost_complexity_pruning_path(x_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
# Display effective alphas and total impurities
pd.DataFrame(path)
# Display effective alphas with regards to total impurities
fig, ax = plt.subplots(figsize = (10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker = "o", drawstyle = "steps-post")
ax.set_title("Total Impurity vs Effective Alpha for training set")
ax.set_xlabel("Effective Alpha")
ax.set_ylabel("Total Iimpurity of leaves")
plt.show()
# Create list of Decision Tree model with post-pruning
model_post_prune_list = []
for ccp_alpha in ccp_alphas:
model_post_prune = ste.DecisionTreeClassifier(random_state = 1, class_weight = {0: 0.15, 1: 0.85}, ccp_alpha = ccp_alpha)
model_post_prune.fit(x_train, y_train)
model_post_prune_list.append(model_post_prune)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(model_post_prune_list[-1].tree_.node_count, ccp_alphas[-1]))
clfs and ccp_alphas, because it is the trivial tree with only one node.# Remove the last element in the models and effective alphas
model_post_prune_list = model_post_prune_list[:-1]
ccp_alphas = ccp_alphas[:-1]
# Display number of nodes and tree depth as alpha increases
node_counts = [model_post_prune.tree_.node_count for model_post_prune in model_post_prune_list]
depth = [model_post_prune.tree_.max_depth for model_post_prune in model_post_prune_list]
fig, ax = plt.subplots(2, 1, figsize = (10, 7))
ax[0].plot(ccp_alphas, node_counts, marker = "o", drawstyle = "steps-post")
ax[0].set_title("Number of nodes vs Alpha", fontsize = 11)
ax[0].set_ylabel("Number of nodes")
ax[0].set_xlabel("Alpha")
ax[1].plot(ccp_alphas, depth, marker = "o", drawstyle = "steps-post")
ax[1].set_title("Depth vs Alpha", fontsize = 11)
ax[1].set_ylabel("Depth of tree")
ax[1].set_xlabel("Alpha")
fig.tight_layout()
# Create list of recall values of train data
recall_train_list = []
for model_post_prune in model_post_prune_list:
pred_train = model_post_prune.predict(x_train)
values_train = smt.recall_score(y_train, pred_train)
recall_train_list.append(values_train)
# Create list of recall values of test data
recall_test_list = []
for model_post_prune in model_post_prune_list:
pred_test = model_post_prune.predict(x_test)
values_test = smt.recall_score(y_test, pred_test)
recall_test_list.append(values_test)
# Display the distribution of Alpha with regards to Recall
fig, ax = plt.subplots(figsize = (15, 5))
ax.plot(ccp_alphas, recall_train_list, marker = "o", label = "train", drawstyle = "steps-post")
ax.plot(ccp_alphas, recall_test_list, marker = "o", label = "test", drawstyle = "steps-post")
ax.set_title("Recall vs Alpha for train and test data")
ax.set_ylabel("Recall")
ax.set_xlabel("Alpha")
ax.legend()
plt.show()
# Create best Decision Tree model with post-pruning
index_model_post_prune = np.argmax(recall_test_list)
model_post_prune_best = model_post_prune_list[index_model_post_prune]
model_post_prune_best
# Display the best confision matrix of train data with post-pruning
show_confusion_matrix_sklearn(model = model_post_prune_best, predictors = x_train, target = y_train)
# Display the best model performance of train data with post-pruning
decision_tree_perf_train_post_prune = get_model_perf_class_sklearn(model = model_post_prune_best, predictors = x_train, target = y_train)
decision_tree_perf_train_post_prune
# Display the best confision matrix of test data with post-pruning
show_confusion_matrix_sklearn(model = model_post_prune_best, predictors = x_test, target = y_test)
# Display the best model performance of test data with post-pruning
decision_tree_perf_test_post_prune = get_model_perf_class_sklearn(model = model_post_prune_best, predictors = x_test, target = y_test)
decision_tree_perf_test_post_prune
# Display the best Decision Tree with post-pruning
show_decision_tree(model = model_post_prune_best, feature_names = feature_names)
# Display the best Decision Tree with post-pruning as text
print(ste.export_text(model_post_prune_best, feature_names = feature_names, show_weights = True))
# Importance of features in the tree building
importances_post_prune = model_post_prune_best.feature_importances_
indices_post_prune = np.argsort(importances_post_prune)
importances_post_prune
# Display feature importance with post-pruning
show_feature_importance(importances = importances_post_prune, indices = indices_post_prune, feature_names = feature_names)
# Display training performance comparison
df_model_train_comp = pd.concat([
decision_tree_perf_train_wo_class_weight.transpose(),
decision_tree_perf_train_w_class_weight.transpose(),
decision_tree_perf_train_pre_prune.transpose(),
decision_tree_perf_train_post_prune.transpose(),
],
axis = 1
)
df_model_train_comp.columns = [
"Decision Tree without class_weight",
"Decision Tree with class_weight",
"Decision Tree (Pre-pruning)",
"Decision Tree (Post-pruning)",
]
df_model_train_comp
# Display testing performance comparison
df_model_test_comp = pd.concat([
decision_tree_perf_test_wo_class_weight.transpose(),
decision_tree_perf_test_w_class_weight.transpose(),
decision_tree_perf_test_pre_prune.transpose(),
decision_tree_perf_test_post_prune.transpose(),
],
axis = 1,
)
df_model_test_comp.columns = [
"Decision Tree without class_weight",
"Decision Tree with class_weight",
"Decision Tree (Pre-pruning)",
"Decision Tree (Post-pruning)",
]
df_model_test_comp
The Decision Tree (Post-pruning) model shows that it can predict 98.66% of the time that the customer will accept a loan if the customer meet the following conditions:
OR:
OR: